HomeWork 1- House Prices

submit:

Iris Grabov & Shay Gelbart

About the task:

This project focuses on predicting house prices using a linear regression model built on a comprehensive housing dataset. The process involved exploring the data to understand its structure, cleaning and preprocessing to address missing values and outliers, and analyzing relationships between features and the target variable. By leveraging advanced techniques in feature engineering and model evaluation, we aimed to build an accurate and interpretable predictive model.

Imports

This code sets up the environment for data analysis and machine learning using libraries like numpy, matplotlib, sklearn, and pandas. It customizes plot settings for readability and defines a threshold (min_correlation = 0.2) to filter out features with weak correlations (below 0.2) with the target variable.

In [46]:
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import sklearn
from sklearn import datasets
from sklearn import pipeline, preprocessing
from sklearn import metrics
from sklearn import linear_model
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import seaborn as sns # we will use it for showing the regression line

# define plt settings 
plt.rcParams["font.size"] = 20
plt.rcParams["axes.labelsize"] = 20
plt.rcParams["xtick.labelsize"] = 20
plt.rcParams["ytick.labelsize"] = 20
plt.rcParams["legend.fontsize"] = 20
plt.rcParams["figure.figsize"] = (20,10)

min_correlation = 0.2

The code loads the training and test data from CSV files, resets their indices to a default integer range (0 to N), and displays the DataFrames.

In [47]:
train_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
train_df.reset_index(drop=True, inplace=True) # making the indexes to be from 0 to N, where drop resets the index to the default integer index, and inplace modify the df rather than creating a new one.
print("the train data:")
display(train_df)

test_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")
test_df.reset_index(drop=True, inplace=True) # making the indexes to be from 0 to N, where drop resets the index to the default integer index, and inplace modify the df rather than creating a new one.
print("the test data:")
display(test_df)
the train data:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

the test data:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 2915 160 RM 21.0 1936 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2006 WD Normal
1455 2916 160 RM 21.0 1894 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 4 2006 WD Abnorml
1456 2917 20 RL 160.0 20000 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 9 2006 WD Abnorml
1457 2918 85 RL 62.0 10441 Pave NaN Reg Lvl AllPub ... 0 0 NaN MnPrv Shed 700 7 2006 WD Normal
1458 2919 60 RL 74.0 9627 Pave NaN Reg Lvl AllPub ... 0 0 NaN NaN NaN 0 11 2006 WD Normal

1459 rows × 80 columns

Overview of the columns

we have a table for the test and a table for the train: There's 81 columns:

  1. id
  2. 79 features
  3. target- SalePrice

In the initial phase of our work, we will explore and analyze the data. We will examine the features to understand their impact on the target

In [48]:
print("the columns in train:")
train_df.columns
the columns in train:
Out[48]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
In [49]:
print("the columns in test:")
test_df.columns
the columns in test:
Out[49]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition'],
      dtype='object')
In [50]:
print("the train describe:")
train_df.describe()
the train describe:
Out[50]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

Features type

Explanation of the features in our data we have features that have different types we have numerical, categorical and ordinal features. we want to understand mo eabout our data.

In [51]:
categorical = train_df.dtypes[train_df.dtypes == "object"].index
print("Number of Categorical features: ", len(categorical))

numerical = train_df.dtypes[train_df.dtypes != "object"].index
print("Number of Numerical features: ", len(numerical))
Number of Categorical features:  43
Number of Numerical features:  38
In [52]:
print("numerical:")
print(train_df[numerical].columns)
print("\ncategorical:")
print(train_df[categorical].columns)
numerical:
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

categorical:
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

Preprocessing

Checking for missing values

Empty values can be '' in string columns, or NaN values.

When we get a new dataset, we fill the empty values with the feature's median or its mean, or remove the rows/columns of those values, if we have enough data.

In [53]:
# check for Nan values in the dataset
print("Nan values in the train:")
train_df.isna().any() # check if there are Nan values
Nan values in the train:
Out[53]:
Id               False
MSSubClass       False
MSZoning         False
LotFrontage       True
LotArea          False
                 ...  
MoSold           False
YrSold           False
SaleType         False
SaleCondition    False
SalePrice        False
Length: 81, dtype: bool

In this dataset, there are missing values. for example: LotFrontage

Let's check the types of the columns.

In [54]:
# check columns type, if the dataset has mix types. type is an object
train_df.dtypes
Out[54]:
Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

This output is showing the data types of each column in a DataFrame .

  • int64: These columns contain integer values. For example:

    • Id, MSSubClass, LotArea, MoSold, YrSold, and SalePrice are integer columns.
  • float64: This column contains floating-pimal) numbers, such as:

    • LotFrontage is a float column.
  • object: These columns contain categorical data, usually strings. For example:

    • MSZoning, SaleType, SaleCondition are categorical columns, which can store values like labels or categories.
  • Length: 81: This indicates the DataFrame hasing) type.e.

In [55]:
# display the dataset info, count, Nan, columns type, etc.
print("train info:")
train_df.info()
print("\ntest info:")
test_df.info()
train info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

test info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallCond    1459 non-null   int64  
 19  YearBuilt      1459 non-null   int64  
 20  YearRemodAdd   1459 non-null   int64  
 21  RoofStyle      1459 non-null   object 
 22  RoofMatl       1459 non-null   object 
 23  Exterior1st    1458 non-null   object 
 24  Exterior2nd    1458 non-null   object 
 25  MasVnrType     565 non-null    object 
 26  MasVnrArea     1444 non-null   float64
 27  ExterQual      1459 non-null   object 
 28  ExterCond      1459 non-null   object 
 29  Foundation     1459 non-null   object 
 30  BsmtQual       1415 non-null   object 
 31  BsmtCond       1414 non-null   object 
 32  BsmtExposure   1415 non-null   object 
 33  BsmtFinType1   1417 non-null   object 
 34  BsmtFinSF1     1458 non-null   float64
 35  BsmtFinType2   1417 non-null   object 
 36  BsmtFinSF2     1458 non-null   float64
 37  BsmtUnfSF      1458 non-null   float64
 38  TotalBsmtSF    1458 non-null   float64
 39  Heating        1459 non-null   object 
 40  HeatingQC      1459 non-null   object 
 41  CentralAir     1459 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1459 non-null   int64  
 44  2ndFlrSF       1459 non-null   int64  
 45  LowQualFinSF   1459 non-null   int64  
 46  GrLivArea      1459 non-null   int64  
 47  BsmtFullBath   1457 non-null   float64
 48  BsmtHalfBath   1457 non-null   float64
 49  FullBath       1459 non-null   int64  
 50  HalfBath       1459 non-null   int64  
 51  BedroomAbvGr   1459 non-null   int64  
 52  KitchenAbvGr   1459 non-null   int64  
 53  KitchenQual    1458 non-null   object 
 54  TotRmsAbvGrd   1459 non-null   int64  
 55  Functional     1457 non-null   object 
 56  Fireplaces     1459 non-null   int64  
 57  FireplaceQu    729 non-null    object 
 58  GarageType     1383 non-null   object 
 59  GarageYrBlt    1381 non-null   float64
 60  GarageFinish   1381 non-null   object 
 61  GarageCars     1458 non-null   float64
 62  GarageArea     1458 non-null   float64
 63  GarageQual     1381 non-null   object 
 64  GarageCond     1381 non-null   object 
 65  PavedDrive     1459 non-null   object 
 66  WoodDeckSF     1459 non-null   int64  
 67  OpenPorchSF    1459 non-null   int64  
 68  EnclosedPorch  1459 non-null   int64  
 69  3SsnPorch      1459 non-null   int64  
 70  ScreenPorch    1459 non-null   int64  
 71  PoolArea       1459 non-null   int64  
 72  PoolQC         3 non-null      object 
 73  Fence          290 non-null    object 
 74  MiscFeature    51 non-null     object 
 75  MiscVal        1459 non-null   int64  
 76  MoSold         1459 non-null   int64  
 77  YrSold         1459 non-null   int64  
 78  SaleType       1458 non-null   object 
 79  SaleCondition  1459 non-null   object 
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB

We want to undesrtand better what is empty and the amount of it, to better handle the data we have at hand.

In [56]:
# Calculate the total number of missing values in each column and sort in descending order
missing_total = train_df.isnull().sum().sort_values(ascending=False)

# Calculate the percentage of missing values for each column and sort in descending order
missing_percentage = (train_df.isnull().sum() / len(train_df)).sort_values(ascending=False)

# Combine the total and percentage of missing values into a single DataFrame
missing_info = pd.concat([missing_total, missing_percentage], axis=1, keys=['Total Missing', 'Percentage'])

# Display the top 20 columns with the most missing data
missing_info.head(20)
Out[56]:
Total Missing Percentage
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
MasVnrType 872 0.597260
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageYrBlt 81 0.055479
GarageCond 81 0.055479
GarageType 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtFinType2 38 0.026027
BsmtExposure 38 0.026027
BsmtQual 37 0.025342
BsmtCond 37 0.025342
BsmtFinType1 37 0.025342
MasVnrArea 8 0.005479
Electrical 1 0.000685
Id 0 0.000000

In the numerical types we want to fill the data with the mean. The numerical types that have missing data are:

LotFrontage - frontage of the lot in feet. MasVnrArea - area of masonry veneer in square feet. GarageYrBlt - the year the garage was built.

In [57]:
# Fill missing values with the mean for specific columns
columns_to_fill = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea']

# Apply mean to the specified columns for training and testing data
train_df[columns_to_fill] = train_df[columns_to_fill].fillna(train_df[columns_to_fill].mean())
test_df[columns_to_fill] = test_df[columns_to_fill].fillna(test_df[columns_to_fill].mean())

After imputing the missing numerical features with their mean values, you need to handle the remaining missing data. We have different approaches to do it. If we take the PoolQC feature as an example, it would not be correct to remove all the rows that have no pool because this would damage the data.

In [58]:
# List of columns where missing values have a specific meaning (e.g., "None")
cols_with_meaningful_nan = [
    'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 'FireplaceQu',
    'GarageQual', 'GarageCond', 'GarageFinish', 'GarageType', 'Electrical',
    'KitchenQual', 'SaleType', 'Functional', 'Exterior2nd', 'Exterior1st',
    'BsmtExposure', 'BsmtCond', 'BsmtQual', 'BsmtFinType1', 'BsmtFinType2',
    'MSZoning', 'Utilities'
]

for column in cols_with_meaningful_nan:
    most_frequent_value = train_df[column].mode()[0]

    train_df[column] = train_df[column].fillna(most_frequent_value)
    test_df[column] = test_df[column].fillna(most_frequent_value)

After we do all that, lets check now if we still have missing data

In [59]:
train_df.isnull().sum().sum()
Out[59]:
0
In [60]:
test_df.isnull().sum().sum()
Out[60]:
10
In [61]:
print(test_df.isnull().sum()[test_df.isnull().sum() > 0])
BsmtFinSF1      1
BsmtFinSF2      1
BsmtUnfSF       1
TotalBsmtSF     1
BsmtFullBath    2
BsmtHalfBath    2
GarageCars      1
GarageArea      1
dtype: int64

Explanation for each row:

  1. BsmtFinSF1: 1 – The square footage of the finished basement aaes.
  2. BsmtFinSF2: 1 – The square footage of another finished basement alue.
  3. BsmtUnfSF: 1 – The square footage of the unfinished basement alue.
  4. TotalBsmtSF: 1 – The total basement square fotries.
  5. BsmtFullBath: 2 – The number of full bathrooms in the ba or 2).
  6. BsmtHalfBath: 2 – The number of half bathrooms in the b0 or 1).
  7. GarageCars: 1 – The number of cars th entries.
  8. GarageArea: 1 – The total area of tl entries.
In [62]:
# Fill missing values with the mean for specific columns
columns_to_fill = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageCars', 'GarageArea']


# Apply mean to the specified columns for testing data
test_df[columns_to_fill] = test_df[columns_to_fill].fillna(test_df[columns_to_fill].mean())

The final cheak:

In [63]:
test_df.isnull().sum().sum()
Out[63]:
0

Overview

In [64]:
train_df.columns
Out[64]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

Transformation

The code calculates the age of the house and garage by subtracting the year they were built from the current year (2024). This transformation makes the features more relevant for modeling, as the age of the property is typically more useful than the year it was built.

In [65]:
for df in [train_df,test_df]:
    df['YearBuilt'] = 2024 - df['YearBuilt']
    df['GarageYrBlt'] = 2024 - df['GarageYrBlt']

Relation of numerical features to target

The goal of this code is to visually inspect the relationships between various numerical features and the target variable (SalePrice), while also displaying the correlation and p-value for each feature. This helps to understand the strength and significance of these relationships for feature selection.

In [66]:
# Number of rows and columns in the subplot grid
nr_rows = 15
nr_cols = 2

# Create the figure and axes for the subplots
fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols * 6, nr_rows * 6))

# List of numerical features, excluding the target and 'Id' columns
li_numerical = list(numerical)
li_not = ['Id', 'SalePrice']
#li_plot_numerical = [c for c in li_numerical if c not in li_not]
li_plot_numerical = [col for col in li_numerical if train_df[col].nunique() > 1]

# Create a color palette
colors = sns.color_palette("viridis", n_colors=len(li_plot_numerical))

# Loop through each subplot and create the regression plots
for r in range(nr_rows):
    for c in range(nr_cols):
        i = r * nr_cols + c
        if i < len(li_plot_numerical):
            # Create the regression plot for each feature vs the target 'SalePrice'
            sns.regplot(
                x=train_df[li_plot_numerical[i]], 
                y=train_df['SalePrice'], 
                ax=axs[r][c],
                scatter_kws={"s": 40, "color": colors[i]},  # Set the scatter point color
                line_kws={"color": colors[i]},  # Set the regression line color
            )
            
            # Calculate the Pearson correlation coefficient and p-value
            stp = stats.pearsonr(train_df[li_plot_numerical[i]], train_df['SalePrice'])
            
            # Format the title with the correlation and p-value
            str_title = f"r = {stp[0]:.2f}      p = {stp[1]:.2f}"
            axs[r][c].set_title(str_title, fontsize=8)

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()

To implement the selection of the features, we decided to based on their correlation with SalePrice being above a threshold. min_correlation= 0.2

You can adjust it later based on model performance and the number of features.

In [67]:
# Ensure you have only numeric columns for correlation calculation
numeric_df = train_df.select_dtypes(include=['number'])

# Calculate the correlation matrix
corr = numeric_df.corr()

# Get the absolute values of the correlation matrix
corr_abs = corr.abs()

# Number of numerical columns (excluding the target)
nr_num_cols = len(numerical)

# Get the correlation of all numerical features with 'SalePrice'
ser_corr = corr_abs['SalePrice'].nlargest(nr_num_cols)

# Select features whose correlation is above the min_val_corr threshold
cols_abv_corr_limit = list(ser_corr[ser_corr.values > min_correlation].index)

# Select features whose correlation is below the min_val_corr threshold
cols_below_corr_limit = list(ser_corr[ser_corr.values <= min_correlation].index)

# Print the list of features with correlation above the threshold
print("List of numerical features above min correlation:\n")
print(cols_abv_corr_limit)
List of numerical features above min correlation:

['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'GarageYrBlt', 'Fireplaces', 'BsmtFinSF1', 'LotFrontage', 'WoodDeckSF', '2ndFlrSF', 'OpenPorchSF', 'HalfBath', 'LotArea', 'BsmtFullBath', 'BsmtUnfSF']

Relation of categorical features to target

This visualization helps understand the distribution of SalePrice for each categorical feature, highlighting differences between categories and any potential outliers.

In [68]:
# List of categorical features
cat_features = list(categorical)

# Set the grid size for the subplots
rows = 22
cols = 2

# Create a figure with the defined grid size
fig, axes = plt.subplots(rows, cols, figsize=(cols * 6, rows * 6))

# Iterate through each subplot to create a boxplot
for row in range(rows):
    for col in range(cols):
        index = row * cols + col
        if index < len(cat_features):
            sns.boxplot(x=cat_features[index], y='SalePrice', data=train_df, ax=axes[row][col])

            # Set font size for axis labels, ticks, and titles
            axes[row][col].tick_params(labelsize=10)  # smaller tick labels
            axes[row][col].set_xlabel(axes[row][col].get_xlabel(), fontsize=10)  # smaller x-axis labels
            axes[row][col].set_ylabel(axes[row][col].get_ylabel(), fontsize=10)  # smaller y-axis labels
            axes[row][col].set_title(axes[row][col].get_title(), fontsize=12)  # smaller title

# Adjust the layout to prevent overlapping
plt.tight_layout()
plt.show()

For certain features, it is straightforward to identify a strong correlation: 'Neighborhood', 'Electrical', 'ExterQual', 'MasVnrType', 'BsmtQual', 'MSZoning', 'CentralAir', 'Condition2', 'KitchenQual', 'SaleType'. for example, the 'street' feature have a low relation.

In [69]:
catgerical_strong_correlation = ['Neighborhood', 'Electrical', 'ExterQual', 'MasVnrType', 'BsmtQual', 
                     'MSZoning', 'CentralAir', 'Condition2', 'KitchenQual', 'SaleType']

catgerical_weak_correlation = ['SaleCondition', 'MiscFeature', 'Fence', 'PoolQC', 'PavedDrive', 
                   'GarageCond', 'GarageQual', 'GarageFinish', 'GarageType', 'FireplaceQu', 
                   'Functional', 'HeatingQC', 'Heating', 'BsmtFinType2', 'BsmtFinType1', 
                   'BsmtExposure', 'BsmtCond', 'Foundation', 'ExterCond', 'Exterior2nd', 
                   'Exterior1st', 'RoofMatl', 'RoofStyle', 'HouseStyle', 'BldgType', 
                   'Condition1', 'LandSlope', 'LotConfig', 'Utilities', 'LandContour', 
                   'LotShape', 'Alley', 'Street']

Correlation Matrix

The heatmap visually represents the correlation between numerical features, with darker red indicating a stronger positive correlation. This helps identify which features are highly correlated with each other, which can be useful for feature selection or identifying multicollinearity..

In [70]:
# Show absolute correlation between numerical features in a heatmap
plt.figure(figsize=(17,17))
cor = np.abs(train_df[cols_abv_corr_limit].corr())  # Use only numerical features

# Create the heatmap with smaller annotations
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds, vmin=0, vmax=1, annot_kws={"size": 8})  # Change the font size of annotations

# Display the plot
plt.show()

A correlation heatmap helps identify relationships between features and the target variable. Columns drop based on correlation limits and weak correlation for categorical variables:

In [71]:
# Extract the 'Id' column from df_test
id_test = test_df['Id']

# Define the columns to drop based on correlation limits and weak correlation for categorical variables
to_drop_num = cols_below_corr_limit
to_drop_catg = catgerical_weak_correlation

# Create a list of columns to drop, including 'Id'
cols_to_drop = ['Id'] + to_drop_num + to_drop_catg

# Drop the specified columns from both train and test dataframes
for df in [train_df,test_df]:
    df.drop(cols_to_drop, inplace=True, axis=1)

converting categorical features into numerical features

  1. Label Encoding: Transform the categorical data into ordinal data. Translate each category to an integer number. This should be done when there is an order to the values or when there are too many values to handle. To understand batter: sns.violinplot: It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.

  2. One-Hot Encoding: Transform the categorical data into few binary columns. Translate each category into a column with 0 and 1 values (1 if the original categorical value is present in the row, and 0 if not). This should be done when there is no order to the values and where there aren't as many different values in the column. This should be used with regularized regressions (may suffer from bias issues without the added column).

overview:

In [72]:
# Set up the figure and axes
plt.figure(figsize=(16, 5))

# Create a violin plot for 'Neighborhood' against the target
sns.violinplot(x='Neighborhood', y='SalePrice', data=train_df)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Add a title for the plot
plt.title('Violin Plot of Neighborhood vs. SalePrice', fontsize=14)

# Display the plot
plt.tight_layout()
plt.show()
In [73]:
catg_list = catgerical_strong_correlation.copy()
#only for the printing
catg_list.remove('Neighborhood')

for catg in catg_list :
    sns.violinplot(x=catg, y='SalePrice', data=train_df)
    plt.show()

Understand better the features:

In [74]:
# Identify non-numeric columns
non_numeric_columns = train_df.select_dtypes(exclude=['number']).columns

# Iterate through non-numeric columns and print unique values
for col in non_numeric_columns:
    print(f"Column: {col}")
    print(train_df[col].unique())
    print("-" * 50)
Column: MSZoning
['RL' 'RM' 'C (all)' 'FV' 'RH']
--------------------------------------------------
Column: Neighborhood
['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
--------------------------------------------------
Column: Condition2
['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
--------------------------------------------------
Column: MasVnrType
['BrkFace' 'Stone' 'BrkCmn']
--------------------------------------------------
Column: ExterQual
['Gd' 'TA' 'Ex' 'Fa']
--------------------------------------------------
Column: BsmtQual
['Gd' 'TA' 'Ex' 'Fa']
--------------------------------------------------
Column: CentralAir
['Y' 'N']
--------------------------------------------------
Column: Electrical
['SBrkr' 'FuseF' 'FuseA' 'FuseP' 'Mix']
--------------------------------------------------
Column: KitchenQual
['Gd' 'TA' 'Ex' 'Fa']
--------------------------------------------------
Column: SaleType
['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth']
--------------------------------------------------

These columns represent categorical features in the dataset, each containing distinct categories or values. Here's a brief explanation of each:

  • MSZoning: The general zoning classification of the property (e.g., 'RL' for Residential Low Density).
  • Neighborhood: The neighborhood in which the house is located, representing different residential areas.
  • Condition2: The condition of the property relative to nearby roads or streets (e.g., 'Norm' for normal).
  • MasVnrType: The type of masonry veneer on the house (e.g., 'BrkFace' for brick front).
  • ExterQual: The quality of the exterior material of the house (e.g., 'Gd' for good).
  • BsmtQual: The quality of the basement (e.g., 'Gd' for good).
  • CentralAir: Whether the house has central air conditioning (e.g., 'Y' for Yes).
  • Electrical: The type of electrical system (e.g., 'SBrkr' for circuit breakers).
  • KitchenQual: The quality of the kitchen (e.g., 'Gd' for good).
  • SaleType: The type of sale (e.g., 'WD' for warrantarning tasks.

Label Encoding:

In [75]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Define ordinal categories and their order
ordinal_categories = {
    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtQual': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'CentralAir': ['N', 'Y']
}

# Define nominal columns
nominal_categories = ['MSZoning', 'Neighborhood', 'Condition2', 'MasVnrType', 'Electrical', 'SaleType']

# Step 1: Ordinal Encoding
# Extract only the ordinal columns
ordinal_encoder = OrdinalEncoder(categories=[ordinal_categories[col] for col in ordinal_categories])

# Fit and transform the ordinal columns
train_df[list(ordinal_categories.keys())] = ordinal_encoder.fit_transform(train_df[ordinal_categories.keys()])
test_df[list(ordinal_categories.keys())] = ordinal_encoder.transform(test_df[ordinal_categories.keys()])

One-Hot Encoding:

In [76]:
# Step 2: One-Hot Encoding

# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

# Fit the encoder on the nominal columns
encoded_nominal_train = one_hot_encoder.fit_transform(train_df[nominal_categories])
encoded_nominal_test = one_hot_encoder.transform(test_df[nominal_categories])

# Convert the one-hot encoded data into DataFrames
encoded_train_nominal_df = pd.DataFrame(encoded_nominal_train, columns=one_hot_encoder.get_feature_names_out(nominal_categories))
encoded_test_nominal_df = pd.DataFrame(encoded_nominal_test, columns=one_hot_encoder.get_feature_names_out(nominal_categories))

# Reset index to match train and test DataFrame indices
encoded_train_nominal_df.index = train_df.index
encoded_test_nominal_df.index = test_df.index

# Drop the original nominal columns and concatenate the encoded DataFrames
train_df = pd.concat([train_df.drop(columns=nominal_categories), encoded_train_nominal_df], axis=1)
test_df = pd.concat([test_df.drop(columns=nominal_categories), encoded_test_nominal_df], axis=1)
In [77]:
train_df
Out[77]:
LotFrontage LotArea OverallQual YearBuilt YearRemodAdd MasVnrArea ExterQual BsmtQual BsmtFinSF1 BsmtUnfSF ... Electrical_Mix Electrical_SBrkr SaleType_CWD SaleType_Con SaleType_ConLD SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD
0 65.0 8450 7 21 2003 196.0 3.0 4.0 706 150 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 80.0 9600 6 48 1976 0.0 2.0 4.0 978 284 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 68.0 11250 7 23 2002 162.0 3.0 4.0 486 434 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 60.0 9550 7 109 1970 0.0 2.0 3.0 216 540 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 84.0 14260 8 24 2000 350.0 3.0 4.0 655 490 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 62.0 7917 6 25 2000 0.0 2.0 4.0 0 953 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1456 85.0 13175 6 46 1988 119.0 2.0 4.0 790 589 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1457 66.0 9042 7 83 2006 0.0 4.0 3.0 275 877 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1458 68.0 9717 5 74 1996 0.0 2.0 3.0 49 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1459 75.0 9937 5 59 1965 0.0 3.0 3.0 830 136 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

1460 rows × 76 columns

Remove outliers

This code defines a function, remove_outliers, that removes rows from a DataFrame where the target feature (SalePrice) contains outliers based on the Interquartile Range (IQR) method.

Steps:

  1. Compute IQR:

    • Q1: 25th percentile of SalePrice.
    • Q3: 75th percentile of SalePrice.
    • IQR: Difference between Q3 and Q1 (Q3 - Q1).
  2. Set Outlier Bounds:

    • Lower Bound: Q1 - 1.5 * IQR.
    • Upper Bound: Q3 + 1.5 * IQR.
  3. Filter Rows:

    • Keeps rows where SalePrice is within bounds (lower_bound ≤ SalePrice ≤ upper_bound).
  4. Remove Outliers:

    • Applies the mask to the dataset, returning a filtered DataFrame without outliers.

Finally, it removes outliers from train_df based on the SalePrice feature.

In [78]:
def remove_outliers(df, target_feature):
            
    # חישוב IQR עבור המאפיין המספרי
    Q1 = target_feature.quantile(0.25)
    Q3 = target_feature.quantile(0.75)
    IQR = Q3 - Q1

    # חישוב הגבולות
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # סינון את הערכים שנמצאים מחוץ לגבולות
    mask = (target_feature >= lower_bound) & (target_feature <= upper_bound)
    return df[mask]


train_df = remove_outliers(train_df, train_df["SalePrice"])

train_df
Out[78]:
LotFrontage LotArea OverallQual YearBuilt YearRemodAdd MasVnrArea ExterQual BsmtQual BsmtFinSF1 BsmtUnfSF ... Electrical_Mix Electrical_SBrkr SaleType_CWD SaleType_Con SaleType_ConLD SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD
0 65.0 8450 7 21 2003 196.0 3.0 4.0 706 150 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 80.0 9600 6 48 1976 0.0 2.0 4.0 978 284 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 68.0 11250 7 23 2002 162.0 3.0 4.0 486 434 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 60.0 9550 7 109 1970 0.0 2.0 3.0 216 540 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 84.0 14260 8 24 2000 350.0 3.0 4.0 655 490 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 62.0 7917 6 25 2000 0.0 2.0 4.0 0 953 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1456 85.0 13175 6 46 1988 119.0 2.0 4.0 790 589 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1457 66.0 9042 7 83 2006 0.0 4.0 3.0 275 877 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1458 68.0 9717 5 74 1996 0.0 2.0 3.0 49 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1459 75.0 9937 5 59 1965 0.0 3.0 3.0 830 136 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

1399 rows × 76 columns

Hyperparameter Optimization with PCA for KNN and Decision Tree

This script performs Principal Component Analysis (PCA) and hyperparameter tuning to find the best settings for K-Nearest Neighbors (KNN) Regression and Decision Tree Regression using GridSearchCV.

1. Data Preparation

  • The dataset is split into features (X) and the target variable (t).
  • Train-validation split is performed using train_test_split (80% training, 20% validation).

2. Finding the Best PCA and KNN Parameters

  • A Pipeline is created with:
    • PCA() for dimensionality reduction.
    • KNeighborsRegressor() for regression.
  • GridSearchCV is used to search for:
    • The best number of PCA components (pca__n_components).
    • The best number of KNN neighbors (knn__n_neighbors).
  • The best hyperparameters are selected based on Root Mean Squared Error (RMSE).

3. Finding the Best PCA and Decision Tree Parameters

  • A similar Pipeline is created with:
    • PCA() for dimensionality reduction.
    • DecisionTreeRegressor() as the model.
  • GridSearchCV searches for:
    • The best number of PCA components (pca__n_components).
    • The best tree depth (dt__max_depth).
  • The best parameters are chosen based on RMSE.

4. Transforming Data Using Best PCA Settings

  • Two separate PCA transformations are applied:
    • pca_knn with the best PCA components for KNN.
    • pca_dt with the best PCA components for Decision Tree.
  • The transformed datasets (X_knn and X_dt) are ready for model training.

Key Takeaways

  • Different PCA settings are used for KNN and Decision Tree models.
  • The best hyperparameters are selected using cross-validation.
  • PCA helps improve model performance by reducing dimensionality. cuild and evaluate a regression model.
In [79]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV


target_column = 'SalePrice'  # target column

# Define the feature columns (everything except the target column)
X = train_df.drop(columns=[target_column])  # Features
t = train_df[target_column]                # Target
X = StandardScaler().fit_transform(X)

X_train, X_val, t_train, t_val = train_test_split(X, t, test_size=0.2, random_state=42)

### 1. Find Best PCA + KNN Parameters ###
knn_pipe = Pipeline([
    ('pca', PCA()),
    ('knn', KNeighborsRegressor())
])

knn_param_grid = {
    'pca__n_components': range(1, 20),  # PCA components
    'knn__n_neighbors': range(1, 30)    # KNN neighbors
}

knn_grid_search = GridSearchCV(knn_pipe, knn_param_grid, cv=5, scoring='neg_root_mean_squared_error')
knn_grid_search.fit(X, t)
# X_train, t_train
# Get Best PCA settings for KNN
best_pca_knn = knn_grid_search.best_params_['pca__n_components']
best_k = knn_grid_search.best_params_['knn__n_neighbors']
print("Best KNN Parameters:", best_k)
print("Best KNN Score (RMSE):", -knn_grid_search.best_score_)

### 2. Find Best PCA + Decision Tree Parameters ###
dt_pipe = Pipeline([
    ('pca', PCA()),
    ('dt', DecisionTreeRegressor(random_state=42))
])

dt_param_grid = {
    'pca__n_components': range(1, 20),  # PCA components
    'dt__max_depth': range(1, 20)  # Max depth of the tree   
}

dt_grid_search = GridSearchCV(dt_pipe, dt_param_grid, cv=5, scoring='neg_root_mean_squared_error')
dt_grid_search.fit(X, t)

# Get Best PCA settings for Decision Tree
best_pca_dt = dt_grid_search.best_params_['pca__n_components']
best_depth = dt_grid_search.best_params_['dt__max_depth']

print("Best Decision Tree depth:", best_depth)
print("Best Decision Tree Score (RMSE):", -dt_grid_search.best_score_)

# 3. Transform Data Using Best PCA for Each Model
pca_knn = PCA(n_components=best_pca_knn)
pca_dt = PCA(n_components=best_pca_dt)

X_knn = pca_knn.fit_transform(X)

X_dt = pca_dt.fit_transform(X)
Best KNN Parameters: 6
Best KNN Score (RMSE): 23518.990379725816
Best Decision Tree depth: 6
Best Decision Tree Score (RMSE): 26211.621355482264

Model Evaluation with Bagging and Boosting: KNN and Decision Tree

This script evaluates Bagging and Boosting applied to K-Nearest Neighbors (KNN) and Decision Tree regressors. The models are assessed through bootstrap resampling, with 50 iterations to compute Root Mean Squared Error (RMSE) and scores for both training and validation sets.

Key Steps:

  1. Model Setup:

    • KNN and Decision Tree regressors are initialized with optimal hyperparameters (best_k and best_depth).
    • Both models are wrapped in Bagging and Boosting estimators for ensemble learning.
  2. Bootstrap Resampling:

    • In each of the 50 iterations, the data is resampled (with replacement for training and without for validation).
    • The models are trained on the resampled training set and evaluated on the validation set.
  3. Performance Metrics:

    • For each model, RMSE is computed for both the training and validation sets to measure model error.
    • scores are calculated to evaluate how well the models fit the data.
  4. Results:

    • The average RMSE and for training and validation sets are computed and displayed.
    • A plot comparing training vs. validation loss across iterations is generated to visualize model performance.

This approach helps assess the stability and generalization capabilities of the models under different resampling scenarios.

In [80]:
from sklearn.utils import resample
from sklearn.metrics import r2_score
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor

# models
knn = KNeighborsRegressor(n_neighbors=best_k)
dt = DecisionTreeRegressor(max_depth=best_depth)

# bagging
knn_bagging = BaggingRegressor(estimator=knn, n_estimators=50, random_state=42, bootstrap=True)
dt_bagging = BaggingRegressor(estimator=dt, n_estimators=50, random_state=42, bootstrap=True)

# boosting
knn_boosting = AdaBoostRegressor(estimator=knn, n_estimators=50, random_state=42)
dt_boosting = AdaBoostRegressor(estimator=dt, n_estimators=50, random_state=42)

rf_model = RandomForestRegressor(criterion='squared_error', n_estimators=100, random_state=42)

estimators = {
    "KNN_Bagging" : knn_bagging,
    "dt_bagging" : dt_bagging,
    "KNN_Boosting": knn_boosting,
    "dt_boosting" : dt_boosting,
    "Random_Forest" : rf_model
}

n_iterations = 50

for model_name, model in estimators.items():
    print(f"\n{'-' * 20}\nModel: {model_name}\n{'-' * 20}")
    
    train_losses = []
    val_losses = []
    train_r2_scores = []
    val_r2_scores = []

    if "KNN" in model_name:
        X_data = X_knn
    else:  # Decision Tree models
        X_data = X_dt
   
    for iteration in range(n_iterations):
        X_train_sample, t_train_sample = resample(X_data, t, replace=True)
        X_val_sample, t_val_sample = resample(X_data, t, replace=False)

        # Train the model
        model.fit(X_train_sample, t_train_sample)

        # Compute RMSE for train and validation
        train_loss = np.sqrt(mean_squared_error(t_train_sample, model.predict(X_train_sample)))
        val_loss = np.sqrt(mean_squared_error(t_val_sample, model.predict(X_val_sample)))

        train_losses.append(train_loss)
        val_losses.append(val_loss)

         # Compute R² for train and validation
        train_r2 = model.score(X_train_sample, t_train_sample)  # R² on training data
        val_r2 = model.score(X_val_sample, t_val_sample)  # R² on validation data

        train_r2_scores.append(train_r2)
        val_r2_scores.append(val_r2)
        
    # Compute averages
    mean_train_loss = np.mean(train_losses)
    mean_val_loss = np.mean(val_losses)
    mean_train_r2 = np.mean(train_r2_scores)
    mean_val_r2 = np.mean(val_r2_scores)

    
    print(f"\nResults for {model_name}:")
    print(f"Average Training Loss (RMSE): {mean_train_loss:.4f}")
    print(f"Average Validation Loss (RMSE): {mean_val_loss:.4f}")
    print(f"Average Training R²: {mean_train_r2:.4f}")
    print(f"Average Validation R²: {mean_val_r2:.4f}")


    # Plot training vs validation loss
    plt.figure(figsize=(10, 5))
    plt.plot(train_losses, label="Training Loss")
    plt.plot(val_losses, label="Validation Loss")
    plt.xlabel('Bootstrap Iterations')
    plt.ylabel('Loss')
    plt.legend()
    plt.title(f'{model_name} - Training vs Validation Loss')
    plt.show()
--------------------
Model: KNN_Bagging
--------------------

Results for KNN_Bagging:
Average Training Loss (RMSE): 16962.5839
Average Validation Loss (RMSE): 20958.4550
Average Training R²: 0.9175
Average Validation R²: 0.8747
--------------------
Model: dt_bagging
--------------------

Results for dt_bagging:
Average Training Loss (RMSE): 14654.3498
Average Validation Loss (RMSE): 19084.0792
Average Training R²: 0.9387
Average Validation R²: 0.8961
--------------------
Model: KNN_Boosting
--------------------

Results for KNN_Boosting:
Average Training Loss (RMSE): 11824.2573
Average Validation Loss (RMSE): 19608.1535
Average Training R²: 0.9602
Average Validation R²: 0.8902
--------------------
Model: dt_boosting
--------------------

Results for dt_boosting:
Average Training Loss (RMSE): 11661.0580
Average Validation Loss (RMSE): 17038.6330
Average Training R²: 0.9613
Average Validation R²: 0.9171
--------------------
Model: Random_Forest
--------------------

Results for Random_Forest:
Average Training Loss (RMSE): 5652.2525
Average Validation Loss (RMSE): 15018.5026
Average Training R²: 0.9908
Average Validation R²: 0.9356

Explanation of Large Differences Between Train and Validation Performance

You are observing large differences in performance between the train and validation data for all models:

  • Bagging KNN:

    • Training Loss (RMSE): 18,263.75
    • Validation Loss (RMSE): 21,580.38
    • Training R²: 0.9049
    • Validation R²: 0.8671
  • Bagging Decision Tree:

    • Training Loss (RMSE): 14,792.99
    • Validation Loss (RMSE): 18,956.28
    • Training R²: 0.9375
    • Validation R²: 0.8975
  • Boosting KNN:

    • Training Loss (RMSE): 13,214.99
    • Validation Loss (RMSE): 19,915.58
    • Training R²: 0.9499
    • Validation R²: 0.8868
  • Boosting Decision Tree:

    • Training Loss (RMSE): 11,662.96
    • Validation Loss (RMSE): 17,078.59
    • Training R²: 0.9610
    • Validation R²: 0.9167
  • Random Forest:

    • Training Loss (RMSE): 5,697.46
    • Validation Loss (RMSE): 14,987.11
    • Training R²: 0.9906
    • Validation R²: 0.9359

Possible Reasons

1. Overfitting

  • Random Forest and Boosting Decision Trees show significant differences between training and validation, indicating overfitting due to deep trees.
  • Boosting KNN and Bagging Decision Tree models also exhibit some degree of overfitting, though less severe.

2. High Variance Models

  • Bagging helps reduce variance, but individual trees can still overfit if not properly regularized.
  • Boosting can overfit if the model corrects even minor errors, making it sensitive to noise in the training data.

3. Data Distribution Differences

  • If the train and validation sets have different distributions, models may not generalize well.

Random Forest Model Evaluation

This code trains a Random Forest Regressor on PCA-transformed data, using the squared_error criterion. It evaluates the model's performance on the validation set by calculating RMSE and scores.

Why we didn't use LWLR:

Locally Weighted Linear Regression (LWLR) is not suitable for a house price prediction competition for several reasons:

  1. Computational Cost – LWLR is non-parametric, meaning it computes weights and fits a model for each prediction. This is computationally expensive, especially with large datasets like house price prediction.
  1. Scalability Issues – Since LWLR recalculates weights for each prediction, it does not scale well for datasets with thousands or millions of houses.
  1. High-Dimensional Data – House price data often has many features (e.g., location, square footage, number of rooms). LWLR struggles with high-dimensional data, leading to overfitting and inefficiency.

Train the Linear Regression Model

  • Purpose: Fit a linear regression model on the scaled training data.
  • Process:
    • LinearRegression(): Instantiates a simple linear regression model.
    • fit(X_train_scaled, y_train): Trains the model using scaled training data and target values.

Evaluate the Model

  • Purpose: Measure the model's performance on training and validation sets.
  • Process:
    • NE_reg.score: Computes the R² score:
      • Training R²: Measures how well the model fits the training data.
      • Validation R²: Measures how well the model generalrd way to build and evaluate a regression model.
In [82]:
# Step 2: Train the model
model = LinearRegression()
NE_reg = model.fit(X_train, t_train)

# calculate R2 score for each group
print('R2 score on train', NE_reg.score(X_train, t_train))
print('R2 score on validation', NE_reg.score(X_val, t_val))
R2 score on train 0.858029371226417
R2 score on validation 0.8705569645275424

Calculate MSE and RMSE

In [83]:
# calculate MSE and RMSE

y_train = NE_reg.predict(X_train)
y_val = model.predict(X_val)
print('MSE on train', metrics.mean_squared_error(y_train, t_train))
print('MSE on validation', metrics.mean_squared_error(y_val, t_val))
print()
print('RMSE on train', metrics.mean_squared_error(y_train, t_train, squared=False))
print('RMSE on validation', metrics.mean_squared_error(y_val, t_val, squared=False))
MSE on train 508392483.20423704
MSE on validation 411734659.2209544

RMSE on train 22547.56047124028
RMSE on validation 20291.24587650927

Competition

In [87]:
############################# OUTPUT FOR HOMEWORK 1, LINEAR REGRESSION:

# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LinearRegression
# import pandas as pd

# # Example pipeline (Ensure pipeline is defined and trained)
# pipeline_model = Pipeline([
#     ('scaler', StandardScaler()),
#     ('regressor', LinearRegression())
# ])

# # Train the pipeline
# pipeline_model.fit(X_train, y_train)

# # Predict on test data
# # X_test = test_encoded[high_corr_features]  # Ensure these variables are defined
# predictions = pipeline_model.predict(test_df)

# # Create the submission file
# output = pd.DataFrame({'Id': id_test,  # Ensure test_ID is defined
#                        'SalePrice': predictions})

# output.to_csv('submission.csv', index=False)

# print("Saved the predictions to a .csv file")

################################ OUTPUT FOR HOMEWORK 4, DECISION TREE:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

test_df_scaled = StandardScaler().fit_transform(test_df)
test_df_pca = pca_dt.fit_transform(test_df_scaled)
predictions = dt_boosting.predict(test_df_pca)

# Create the submission file
output = pd.DataFrame({'Id': id_test,  # Ensure id_test is defined
                       'SalePrice': predictions})

output.to_csv('submission.csv', index=False)

print("Saved the predictions to a .csv file")
Saved the predictions to a .csv file

Linear Regression performance

The code implements a function to evaluate Linear Regression performance across various splits of the data, plotting the results for Mean Squared Error (MSE) and R² score for both training and validation sets.

In [85]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import linear_model
from sklearn import model_selection
import matplotlib.pyplot as plt

# פונקציה להדפסת גרפים של MSE ו-R2 בעזרת matplotlib
def print_graphs_r2_mse(graph_points):
    for k, v in graph_points.items():
        # הדפסת הערכים כדי לוודא שהם משתנים
        print(f'{k}: {v}')

        # חישוב הערך המקסימלי/המינימלי של כל גרף
        best_value = max(v.values()) if 'R2' in k else min(v.values())
        best_index = np.argmax(list(v.values())) if 'R2' in k else np.argmin(list(v.values()))
        color = 'red' if 'train' in k else 'blue'
        
        # יצירת הגרף עם matplotlib
        plt.plot(list(v.keys()), list(v.values()), color=color, label=k)
        
        # הוספת כותרת
        plt.title(f'{k}, best value: x={best_index + 1}, y={best_value}')
        plt.xlabel('Epochs (Test size)')
        plt.ylabel('Score (MSE/R2)')
        plt.legend()
        plt.show()

# פונקציה לחישוב השגיאה לאורך כל האפוקות
def plot_score_and_loss_by_split(X, t):
    graph_points = {
        'train_MSE': {},
        'val_MSE': {},
        'train_R2': {},
        'val_R2': {}
    }

    for size in range(10, 100, 10):  # לולאת האפוקות
        X_train, X_val, t_train, t_val = model_selection.train_test_split(
            X.values, t.values, test_size=size/100, random_state=42)
        
        # אימון המודל
        NE_reg = linear_model.LinearRegression().fit(X_train, t_train)
        
        # חישוב תחזיות והערכות
        y_train = NE_reg.predict(X_train)
        y_val = NE_reg.predict(X_val)
        
        # חישוב MSE ו-R2
        graph_points['train_MSE'][size/100] = metrics.mean_squared_error(t_train, y_train)
        graph_points['val_MSE'][size/100] = metrics.mean_squared_error(t_val, y_val)
        graph_points['train_R2'][size/100] = NE_reg.score(X_train, t_train)
        graph_points['val_R2'][size/100] = NE_reg.score(X_val, t_val)
    
    # הדפסת הגרפים
    print_graphs_r2_mse(graph_points)

# הנחה שאתה כבר משתמש בנתונים X ו-t
plot_score_and_loss_by_split(train_df.drop(columns=['SalePrice']), train_df['SalePrice'])
train_MSE: {0.1: 491258889.1547774, 0.2: 508392483.20423704, 0.3: 511939935.1603503, 0.4: 519078175.4552234, 0.5: 504888107.73232025, 0.6: 513924486.5376029, 0.7: 523289455.8370989, 0.8: 465080812.726987, 0.9: 328690935.81812215}
val_MSE: {0.1: 435342637.3464779, 0.2: 411734659.22097474, 0.3: 498522530.89003783, 0.4: 514517686.45261925, 0.5: 579953362.6951003, 0.6: 643808813.7654803, 0.7: 693390292.1352198, 0.8: 930683402.1242023, 0.9: 1465329826.7272117}
train_R2: {0.1: 0.8601420393472337, 0.2: 0.858029371226417, 0.3: 0.8585956846050147, 0.4: 0.8585901147430888, 0.5: 0.86186963420368, 0.6: 0.8508781912118052, 0.7: 0.858359528963432, 0.8: 0.8743591875527014, 0.9: 0.9138717778708852}
val_R2: {0.1: 0.87347126689771, 0.2: 0.870556964527536, 0.3: 0.8452721541035995, 0.4: 0.8414499292736906, 0.5: 0.8271040621490705, 0.6: 0.8183472100589478, 0.7: 0.7973878080464568, 0.8: 0.7307738079137898, 0.9: 0.5778406187993963}

The output represents the training and validation performance metrics for different test sizes (from 10% to 90% of the dataset).

Metrics:

  1. MSE (Mean Squared Error):

    • train_MSE: Measures training error (lower is better). Generally stable, but decreases significantly at 90% test size, suggesting overfitting.
    • val_MSE: Measures validation error. Increases with larger test sizes, indicating poorer generalization for smaller training sets.
  2. R² (Coefficient of Determination):

    • train_R2: Measures goodness of fit on training data. Stays high, peaking at 90% test size, which may indicate overfitting.
    • val_R2: Measures goodness of fit on validation data. Declines as test size increases, suggesting reduced model performance.

Key Insights:

  • Small test sizes (10-30%):
    • Good balance between train and validation performance, suggesting optimal splits.
  • Large test sizes (80-90%):
    • Overfitting emerges as the training error decreases (low train_MSE) and validation error increases (high val_MSE, low val_R²).

This indicates the model works best with a moderate split (e.g., 10-30% test size) for this dataset

Project Summary

In this project, we developed a predictive model for house prices using a combination of machine learning techniques. Initially, we built a linear regression model and later expanded the approach by incorporating K-Nearest Neighbors (KNN), Decision Tree (both with bagging and boosting), and Random Forest models to improve predictive accuracy. The workflow involved thorough data exploration, preprocessing, feature engineering, and model evaluation to optimize performance.


What We Did

  1. Data Exploration:

    • Investigated dataset structure, identified numerical and categorical features, and examined missing values.
    • Used statistical summaries and visualizations to understand relationships between variables and inform preprocessing decisions.
  2. Handling Missing Data:

    • Numerical features: Imputed missing values using the mean.
    • Categorical features: Replaced missing values with the most frequent category.
  3. Outlier Detection and Removal:

    • Applied the IQR method to identify and remove extreme values, adjusting thresholds to minimize data loss.
  4. Feature Engineering:

    • Identified key predictors based on correlation analysis.
    • Removed features with low contribution to the target variable (SalePrice).
  5. Graphical Analysis:

    • Correlation Analysis: Examined numerical feature correlations with house prices.
    • Boxplots & Violin Plots: Analyzed categorical features (e.g., Neighborhood) for price variations.
    • Regression Plots: Visualized relationships between numerical predictors and the target variable.
  6. Encoding Categorical Variables:

    • Applied One-Hot Encoding for nominal variables and Label Encoding for ordinal ones.
    • Handled unseen categories in test data appropriately.
  7. Model Development & Evaluation:

    • Implemented multiple machine learning models:
      • Linear Regression: Baseline model.
      • K-Nearest Neighbors (KNN): Evaluated performance with different k-values and applied bagging and boosting techniques.
      • Decision Tree: Applied bagging and boosting techniques.
      • Random Forest: Utilized an ensemble approach for improved generalization.
    • Used GridSearchCV to find the best PCA components and optimal hyperparameters for KNN and Decision Tree models.
    • Split data into 80% training and 20% validation sets.
    • Evaluated models using MSE, RMSE, and R² scores.

Insights & Observations

  • Strong Predictors: Features such as OverallQual, GrLivArea, and GarageCars had high correlations with house prices.
  • Neighborhood Effect: Some areas (e.g., StoneBr, NridgHt) exhibited significantly higher average prices.
  • Ensemble Methods Improve Performance: Random Forest and boosting techniques outperformed linear regression by reducing errors and capturing complex patterns.

What Worked Well

  • Expanding Beyond Linear Regression: Adding ensemble models and KNN with bagging and boosting improved predictive accuracy.
  • Feature Engineering: Selectively including high-impact features enhanced model performance.
  • Handling Missing Data: Addressing missing values systematically helped maintain data integrity.
  • Hyperparameter Optimization: GridSearchCV helped find the best PCA components and parameters for KNN and Decision Tree, improving model efficiency.

Challenges & Improvements

  • Outlier Sensitivity: Overly aggressive outlier removal sometimes led to data loss.
  • Categorical Encoding Issues: Handling unseen categories in test data required adjustments.
  • Further Hyperparameter Tuning: Additional fine-tuning could further enhance performance.

Conclusion

By incorporating multiple machine learning models, this project demonstrated the impact of ensemble learning on predictive accuracy. Random Forest, KNN with bagging and boosting, and boosted Decision Trees provided the best performance, highlighting the benefits of leveraging diverse algorithms. The use of GridSearchCV for optimizing PCA and hyperparameters significantly improved model efficiency. Future work could focus on further fine-tuning and exploring additional feature selection techniques to enhance the model.

Our Score:

תמונה של WhatsApp‏ 2024-12-12 בשעה 19.24.38_93f9286c.jpg

תמונה של WhatsApp‏ 2024-12-12 בשעה 19.28.32_0ffcbbee.jpg